252 research outputs found
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
This paper proposes a new strategy for learning powerful cross-modal
embeddings for audio-to-video synchronization. Here, we set up the problem as
one of cross-modal retrieval, where the objective is to find the most relevant
audio segment given a short video clip. The method builds on the recent
advances in learning representations from cross-modal self-supervision.
The main contributions of this paper are as follows: (1) we propose a new
learning strategy where the embeddings are learnt via a multi-way matching
problem, as opposed to a binary classification (matching or non-matching)
problem as proposed by recent papers; (2) we demonstrate that performance of
this method far exceeds the existing baselines on the synchronization task; (3)
we use the learnt embeddings for visual speech recognition in self-supervision,
and show that the performance matches the representations learnt end-to-end in
a fully-supervised manner.Comment: Preprint. Work in progres
FaceFilter: Audio-visual speech separation using still images
The objective of this paper is to separate a target speaker's speech from a
mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled
speaker information as an auxiliary conditional feature, we use a single face
image of the target speaker. In this task, the conditional feature is obtained
from facial appearance in cross-modal biometric task, where audio and visual
identity representations are shared in latent space. Learnt identities from
facial images enforce the network to isolate matched speakers and extract the
voices from mixed speech. It solves the permutation problem caused by swapped
channel outputs, frequently occurred in speech separation tasks. The proposed
method is far more practical than video-based speech separation since user
profile images are readily available on many platforms. Also, unlike
speaker-aware separation methods, it is applicable on separation with unseen
speakers who have never been enrolled before. We show strong qualitative and
quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples:
https://youtu.be/ku9xoLh62
MIRNet: Learning multiple identities representations in overlapped speech
Many approaches can derive information about a single speaker's identity from
the speech by learning to recognize consistent characteristics of acoustic
parameters. However, it is challenging to determine identity information when
there are multiple concurrent speakers in a given signal. In this paper, we
propose a novel deep speaker representation strategy that can reliably extract
multiple speaker identities from an overlapped speech. We design a network that
can extract a high-level embedding that contains information about each
speaker's identity from a given mixture. Unlike conventional approaches that
need reference acoustic features for training, our proposed algorithm only
requires the speaker identity labels of the overlapped speech segments. We
demonstrate the effectiveness and usefulness of our algorithm in a speaker
verification task and a speech separation system conditioned on the target
speaker embeddings obtained through the proposed method.Comment: Accepted in Interspeech 202
AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech
The quality of end-to-end neural text-to-speech (TTS) systems highly depends
on the reliable estimation of intermediate acoustic features from text inputs.
To reduce the complexity of the speech generation process, several
non-autoregressive TTS systems directly find a mapping relationship between
text and waveforms. However, the generation quality of these system is
unsatisfactory due to the difficulty in modeling the dynamic nature of prosodic
information. In this paper, we propose an effective prosody predictor that
successfully replicates the characteristics of prosodic features extracted from
mel-spectrograms. Specifically, we introduce a generative model-based
conditional discriminator to enable the estimated embeddings have highly
informative prosodic features, which significantly enhances the expressiveness
of generated speech. Since the estimated embeddings obtained by the proposed
method are highly correlated with acoustic features, the time-alignment of
input texts and intermediate features is greatly simplified, which results in
faster convergence. Our proposed model outperforms several publicly available
models based on various objective and subjective evaluation metrics, even using
a relatively small number of parameters.Comment: Submitted to INTERSPEECH 202
- …